robot gripper
How well can LLMs provide planning feedback in grounded environments?
Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces.
Collision-inclusive Manipulation Planning for Occluded Object Grasping via Compliant Robot Motions
Ren, Kejia, Wang, Gaotian, Morgan, Andrew S., Hang, Kaiyu
Traditional robotic manipulation mostly focuses on collision-free tasks. In practice, however, many manipulation tasks (e.g., occluded object grasping) require the robot to intentionally collide with the environment to reach a desired task configuration. By enabling compliant robot motions, collisions between the robot and the environment are allowed and can thus be exploited, but more physical uncertainties are introduced. To address collision-rich problems such as occluded object grasping while handling the involved uncertainties, we propose a collision-inclusive planning framework that can transition the robot to a desired task configuration via roughly modeled collisions absorbed by Cartesian impedance control. By strategically exploiting the environmental constraints and exploring inside a manipulation funnel formed by task repetitions, our framework can effectively reduce physical and perception uncertainties. With real-world evaluations on both single-arm and dual-arm setups, we show that our framework is able to efficiently address various realistic occluded grasping problems where a feasible grasp does not initially exist.
Automating Robot Failure Recovery Using Vision-Language Models With Optimized Prompts
Chen, Hongyi, Yao, Yunchao, Liu, Ruixuan, Liu, Changliu, Ichnowski, Jeffrey
Current robot autonomy struggles to operate beyond the assumed Operational Design Domain (ODD), the specific set of conditions and environments in which the system is designed to function, while the real-world is rife with uncertainties that may lead to failures. Automating recovery remains a significant challenge. Traditional methods often rely on human intervention to manually address failures or require exhaustive enumeration of failure cases and the design of specific recovery policies for each scenario, both of which are labor-intensive. Foundational Vision-Language Models (VLMs), which demonstrate remarkable common-sense generalization and reasoning capabilities, have broader, potentially unbounded ODDs. However, limitations in spatial reasoning continue to be a common challenge for many VLMs when applied to robot control and motion-level error recovery. In this paper, we investigate how optimizing visual and text prompts can enhance the spatial reasoning of VLMs, enabling them to function effectively as black-box controllers for both motion-level position correction and task-level recovery from unknown failures. Specifically, the optimizations include identifying key visual elements in visual prompts, highlighting these elements in text prompts for querying, and decomposing the reasoning process for failure detection and control generation. In experiments, prompt optimizations significantly outperform pre-trained Vision-Language-Action Models in correcting motion-level position errors and improve accuracy by 65.78% compared to VLMs with unoptimized prompts. Additionally, for task-level failures, optimized prompts enhanced the success rate by 5.8%, 5.8%, and 7.5% in VLMs' abilities to detect failures, analyze issues, and generate recovery plans, respectively, across a wide range of unknown errors in Lego assembly.
Affordance-Guided Reinforcement Learning via Visual Prompting
Lee, Olivia Y., Xie, Annie, Fang, Kuan, Pertsch, Karl, Finn, Chelsea
Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as demonstrations or examples of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics. These models can perform visual reasoning in physical contexts and generate coarse robot motions for various manipulation tasks. Motivated by this range of capability, in this work, we propose and study rewards shaped by vision-language models (VLMs). State-of-the-art VLMs have demonstrated an impressive ability to reason about affordances through keypoints in zero-shot, and we leverage this to define dense rewards for robotic learning. On a real-world manipulation task specified by natural language description, we find that these rewards improve the sample efficiency of autonomous RL and enable successful completion of the task in 20K online finetuning steps. Additionally, we demonstrate the robustness of the approach to reductions in the number of in-domain demonstrations used for pretraining, reaching comparable performance in 35K online finetuning steps.
ContactHandover: Contact-Guided Robot-to-Human Object Handover
Wang, Zixi, Liu, Zeyi, Ouporov, Nicolas, Song, Shuran
Robot-to-human object handover is an important step in many human robot collaboration tasks. A successful handover requires the robot to maintain a stable grasp on the object while making sure the human receives the object in a natural and easy-to-use manner. We propose ContactHandover, a robot to human handover system that consists of two phases: a contact-guided grasping phase and an object delivery phase. During the grasping phase, ContactHandover predicts both 6-DoF robot grasp poses and a 3D affordance map of human contact points on the object. The robot grasp poses are reranked by penalizing those that block human contact points, and the robot executes the highest ranking grasp. During the delivery phase, the robot end effector pose is computed by maximizing human contact points close to the human while minimizing the human arm joint torques and displacements. We evaluate our system on 27 diverse household objects and show that our system achieves better visibility and reachability of human contacts to the receiver compared to several baselines. More results can be found on https://clairezixiwang.github.io/ContactHandover.github.io
Evaluating online elasticity estimation of soft objects using standard robot grippers
Patni, Shubhan P., Stoudek, Pavel, Chlup, Hynek, Hoffmann, Matej
Standard robot grippers are not designed for elasticity estimation. In this work, a professional biaxial compression device was used as a control setup to study the accuracy with which material properties can be estimated by two standard parallel jaw grippers and a force/torque sensor mounted at the robot wrist. Using three sets of deformable objects, different parameters were varied to observe their effect on measuring material characteristics: (1) repeated compression cycles, (2) compression speed, and (3) the surface area of the gripper jaws. Gripper effort versus position curves were obtained and transformed into stress/strain curves. The modulus of elasticity was estimated at different strain points. Viscoelasticity was assessed using the energy absorbed in a compression/decompression cycle, the Kelvin-Voigt, and Hunt-Crossley models. Our results can be summarized as follows: (1) better results were obtained with slower compression speeds, while additional compression cycles or surface area did not improve estimation; (2) the robot grippers, even after calibration, were found to have a limited capability of delivering accurate estimates of absolute values of Young's modulus and viscoelasticity; (3) relative ordering of material characteristics was largely consistent across different grippers; (4) despite the nonlinear characteristics of deformable objects, fitting linear stress/strain approximations led to more stable results than local estimates of Young's modulus; (5) to assess viscoelasticity, the Hunt-Crossley model worked best. Finally, we show that a two-dimensional space representing elasticity and viscoelasticity estimates is advantageous for the discrimination of deformable objects. A single-grasp, online, classification and sorting of such objects is thus possible. An additional contribution is the dataset and data processing codes that we make publicly available.
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
Liu, Zeyi, Bahety, Arpit, Song, Shuran
The ability to detect and analyze failed executions automatically is crucial for an explainable and robust robotic system. Recently, Large Language Models (LLMs) have demonstrated strong reasoning abilities on textual inputs. To leverage the power of LLMs for robot failure explanation, we introduce REFLECT, a framework which queries LLM for failure reasoning based on a hierarchical summary of robot past experiences generated from multisensory observations. The failure explanation can further guide a language-based planner to correct the failure and complete the task. To systematically evaluate the framework, we create the RoboFail dataset with a variety of tasks and failure scenarios. We demonstrate that the LLM-based framework is able to generate informative failure explanations that assist successful correction planning.
Gesture-Informed Robot Assistance via Foundation Models
Lin, Li-Heng, Cui, Yuchen, Hao, Yilun, Xia, Fei, Sadigh, Dorsa
Gestures serve as a fundamental and significant mode of non-verbal communication among humans. Deictic gestures (such as pointing towards an object), in particular, offer valuable means of efficiently expressing intent in situations where language is inaccessible, restricted, or highly specialized. As a result, it is essential for robots to comprehend gestures in order to infer human intentions and establish more effective coordination with them. Prior work often rely on a rigid hand-coded library of gestures along with their meanings. However, interpretation of gestures is often context-dependent, requiring more flexibility and common-sense reasoning. In this work, we propose a framework, GIRAF, for more flexibly interpreting gesture and language instructions by leveraging the power of large language models. Our framework is able to accurately infer human intent and contextualize the meaning of their gestures for more effective human-robot collaboration. We instantiate the framework for interpreting deictic gestures in table-top manipulation tasks and demonstrate that it is both effective and preferred by users, achieving 70% higher success rates than the baseline. We further demonstrate GIRAF's ability on reasoning about diverse types of gestures by curating a GestureInstruct dataset consisting of 36 different task scenarios. GIRAF achieved 81% success rate on finding the correct plan for tasks in GestureInstruct. Website: https://tinyurl.com/giraf23
Active Vapor-Based Robotic Wiper
Kiyokawa, Takuya, Katayama, Hiroki, Takamatsu, Jun
This paper presents a method for the normal estimation of mirrors and transparent objects that are difficult to recognize with a camera. To create a diffuse reflective surface, we propose spraying water vapor onto transparent or mirror surfaces. In the proposed method, we move an ultrasonic humidifier equipped on the tip of a robotic arm to apply sprayed water vapor onto the plane of a target object to form a cross-shaped misted area. Diffuse reflective surfaces are partially generated as misted areas, which allows the camera to detect the surface of the target object. The viewpoint of the gripper-mounted camera is adjusted such that the extracted misted area appears to be the largest in the image, and finally, the plane normal of the target object surface is estimated. Normal estimation experiments were conducted to evaluate the effectiveness of the proposed method. The RMSEs of the azimuth estimation for the mirror and transparent glass were approximately 4.2 and 5.8 degrees, respectively. Consequently, our robot experiments demonstrate that our robotic wiper can perform contact-force-regulated wiping motions to clean a transparent window, as humans do.
If You Are Careful, So Am I! How Robot Communicative Motions Can Influence Human Approach in a Joint Task
Lastrico, Linda, Duarte, Nuno Ferreira, Carfì, Alessandro, Rea, Francesco, Mastrogiovanni, Fulvio, Sciutti, Alessandra, Santos-Victor, José
As humans, we have a remarkable capacity for reading the characteristics of objects only by observing how another person carries them. Indeed, how we perform our actions naturally embeds information on the item features. Collaborative robots can achieve the same ability by modulating the strategy used to transport objects with their end-effector. A contribution in this sense would promote spontaneous interactions by making an implicit yet effective communication channel available. This work investigates if humans correctly perceive the implicit information shared by a robotic manipulator through its movements during a dyadic collaboration task. Exploiting a generative approach, we designed robot actions to convey virtual properties of the transported objects, particularly to inform the partner if any caution is required to handle the carried item. We found that carefulness is correctly interpreted when observed through the robot movements. In the experiment, we used identical empty plastic cups; nevertheless, participants approached them differently depending on the attitude shown by the robot: humans change how they reach for the object, being more careful whenever the robot does the same. This emerging form of motor contagion is entirely spontaneous and happens even if the task does not require it.